IRIX Base Documentation 2001 May

home *** CD-ROM | disk | FTP | other *** search

/ IRIX Base Documentation 2001 May / SGI IRIX Base Documentation 2001 May.iso / usr / share / catman / p_man / cat3 / standard / mp.z / mp

Wrap

Text File | 1998-10-30 | 43.1 KB | 925 lines

PPPPaaaaggggeeee 1111 MMMMPPPP((((3333CCCC)))) MMMMPPPP((((3333CCCC)))) NNNNAAAAMMMMEEEE mp: mp_block, mp_blocktime, mp_create, mp_destroy, mp_my_threadnum, mp_numthreads, mp_set_numthreads, mp_setup, mp_unblock, mp_setlock, mp_suggested_numthreads, mp_unsetlock, mp_barrier, mp_in_doacross_loop, mp_set_slave_stacksize - C multiprocessing utility functions SSSSYYYYNNNNOOOOPPPPSSSSIIIISSSS vvvvooooiiiidddd mmmmpppp____bbbblllloooocccckkkk(((()))) vvvvooooiiiidddd mmmmpppp____uuuunnnnbbbblllloooocccckkkk(((()))) vvvvooooiiiidddd mmmmpppp____bbbblllloooocccckkkkttttiiiimmmmeeee((((iiiitttteeeerrrrssss)))) iiiinnnntttt iiiitttteeeerrrrssss vvvvooooiiiidddd mmmmpppp____sssseeeettttuuuupppp(((()))) vvvvooooiiiidddd mmmmpppp____ccccrrrreeeeaaaatttteeee((((nnnnuuuummmm)))) iiiinnnntttt nnnnuuuummmm vvvvooooiiiidddd mmmmpppp____ddddeeeessssttttrrrrooooyyyy(((()))) iiiinnnntttt mmmmpppp____nnnnuuuummmmtttthhhhrrrreeeeaaaaddddssss(((()))) vvvvooooiiiidddd mmmmpppp____sssseeeetttt____nnnnuuuummmmtttthhhhrrrreeeeaaaaddddssss((((nnnnuuuummmm)))) iiiinnnntttt nnnnuuuummmm iiiinnnntttt mmmmpppp____mmmmyyyy____tttthhhhrrrreeeeaaaaddddnnnnuuuummmm(((()))) iiiinnnntttt mmmmpppp____iiiissss____mmmmaaaasssstttteeeerrrr(((()))) vvvvooooiiiidddd mmmmpppp____sssseeeettttlllloooocccckkkk(((()))) vvvvooooiiiidddd mmmmpppp____uuuunnnnsssseeeettttlllloooocccckkkk(((()))) vvvvooooiiiidddd mmmmpppp____bbbbaaaarrrrrrrriiiieeeerrrr(((()))) iiiinnnntttt mmmmpppp____iiiinnnn____ddddooooaaaaccccrrrroooossssssss____lllloooooooopppp(((()))) vvvvooooiiiidddd mmmmpppp____sssseeeetttt____ssssllllaaaavvvveeee____ssssttttaaaacccckkkkssssiiiizzzzeeee((((ssssiiiizzzzeeee)))) iiiinnnntttt ssssiiiizzzzeeee uuuunnnnssssiiiiggggnnnneeeedddd iiiinnnntttt mmmmpppp____ssssuuuuggggggggeeeesssstttteeeedddd____nnnnuuuummmmtttthhhhrrrreeeeaaaaddddssss((((nnnnuuuummmm)))) uuuunnnnssssiiiiggggnnnneeeedddd iiiinnnntttt nnnnuuuummmm DDDDEEEESSSSCCCCRRRRIIIIPPPPTTTTIIIIOOOONNNN These routines give some measure of control over the parallelism used in C programs. They should not be needed by most users, but will help to tune specific applications. PPPPaaaaggggeeee 2222 MMMMPPPP((((3333CCCC)))) MMMMPPPP((((3333CCCC)))) _m_p__b_l_o_c_k puts all slave threads to sleep via _b_l_o_c_k_p_r_o_c(2). This frees the processors for use by other jobs. This is useful if it is known that the slaves will not be needed for some time, and the machine is being shared by several users. Calls to _m_p__b_l_o_c_k may not be nested; a warning is issued if an attempt to do so is made. _m_p__u_n_b_l_o_c_k wakes up the slave threads that were previously blocked via _m_p__b_l_o_c_k. It is an error to unblock threads that are not currently blocked; a warning is issued if an attempt is made to do so. It is not necessary to explicitly call _m_p__u_n_b_l_o_c_k. When a parallel region is entered, a check is made, and if the slaves are currently blocked, a call is made to _m_p__u_n_b_l_o_c_k automatically. _m_p__b_l_o_c_k_t_i_m_e controls the amount of time a slave thread waits for work before giving up. When enough time has elapsed, the slave thread blocks itself. This automatic blocking is independent of the user level blocking provided by the _m_p__b_l_o_c_k/_m_p__u_n_b_l_o_c_k calls. Slave threads that have blocked themselves will be automatically unblocked upon entering a parallel region. The argument to _m_p__b_l_o_c_k_t_i_m_e is the number of times to spin in the wait loop. By default, it is set to 10,000,000. This takes about .25 seconds on a 200MHz processor. As a special case, an argument of 0 disables the automatic blocking, and the slaves will spin wait without limit. The environment variable _M_P__B_L_O_C_K_T_I_M_E may be set to an integer value. It acts like an implicit call to _m_p__b_l_o_c_k_t_i_m_e during program startup. _m_p__d_e_s_t_r_o_y deletes the slave threads. They are stopped by forcing them to call _e_x_i_t(2). In general, doing this is discouraged. _m_p__b_l_o_c_k can be used in most cases. _m_p__c_r_e_a_t_e creates and initializes threads. It creates enough threads so that the total number is equal to the argument. Since the calling thread already counts as one, _m_p__c_r_e_a_t_e will create one less than its argument in new slave threads. _m_p__s_e_t_u_p also creates and initializes threads. It takes no arguments. It simply calls _m_p__c_r_e_a_t_e using the current default number of threads. Normally the default number is equal to the number of cpu's currently on the machine. If the user has not called either of the thread creation routines already, then _m_p__s_e_t_u_p is invoked automatically when the first parallel region is entered. If the environment variable _M_P__S_E_T_U_P is set, then _m_p__s_e_t_u_p is called during initialization, before any user code is executed. _m_p__n_u_m_t_h_r_e_a_d_s returns the number of threads that would participate in an immediately following parallel region. If the threads have already been created, then it returns the current number of threads. If the threads have not been created, then it returns the current default number of threads. The count includes the master thread. Knowing this count can be useful in optimizing certain kinds of parallel loops by hand, but this function has the side-effect of freezing the number of threads to the PPPPaaaaggggeeee 3333 MMMMPPPP((((3333CCCC)))) MMMMPPPP((((3333CCCC)))) returned value. As a result, this routine should be used sparingly. To determine the number of threads without this side-effect, see the description of _m_p__s_u_g_g_e_s_t_e_d__n_u_m_t_h_r_e_a_d_s below. _m_p__s_e_t__n_u_m_t_h_r_e_a_d_s sets the current default number of threads to the specified value. Note that this call does not directly create the threads, it only specifies the number that a subsequent _m_p__s_e_t_u_p call should use. If the environment variable _M_P__S_E_T__N_U_M_T_H_R_E_A_D_S is set, it acts like an implicit call to _m_p__s_e_t__n_u_m_t_h_r_e_a_d_s during program startup. For convenience when operating among several machines with different numbers of cpus, _M_P__S_E_T__N_U_M_T_H_R_E_A_D_S may be set to an expression involving integer literals, the binary operators + and -, the binary functions min and max, and the special symbolic value _A_L_L which stands for "the total number of available cpus on the current machine." Thus, something simple like setenv MP_SET_NUMTHREADS 7 would set the number of threads to seven. This may be a fine choice on an 8 cpu machine, but would be very bad on a 4 cpu machine. Instead, use something like setenv MP_SET_NUMTHREADS "max(1,all-1)" which sets the number of threads to be one less than the number of cpus on the current machine (but always at least one). If your configuration includes some machines with large numbers of cpus, setting an upper bound is a good idea. Something like: setenv MP_SET_NUMTHREADS "min(all,4)" will request (no more than) 4 cpus. For compatibility with earlier releases, _N_U_M__T_H_R_E_A_D_S is supported as a synonym for _M_P__S_E_T__N_U_M_T_H_R_E_A_D_S. _m_p__m_y__t_h_r_e_a_d_n_u_m returns an integer between 0 and _n-1 where _n is the value returned by _m_p__n_u_m_t_h_r_e_a_d_s. The master process is always thread 0. This is occasionally useful for optimizing certain kinds of loops by hand. _m_p__i_s__m_a_s_t_e_r returns 1 if called by the master process, 0 otherwise. _m_p__s_e_t_l_o_c_k provides convenient (though limited) access to the locking routines. The convenience is that no set up need be done; it may be called directly without any preliminaries. The limitation is that there is only one lock. It is analogous to the _u_s_s_e_t_l_o_c_k(3P) routine, but it takes no arguments and does not return a value. This is useful for serializing access to shared variables (e.g. counters) in a parallel region. Note that it will frequently be necessary to declare those variables as _v_o_l_a_t_i_l_e to ensure that the optimizer does not assign them to a register. _m_p__u_n_s_e_t_l_o_c_k is the companion routine for _m_p__s_e_t_l_o_c_k. It also takes no arguments and does not return a value. _m_p__b_a_r_r_i_e_r provides a simple interface to a single _b_a_r_r_i_e_r(3P). It may be used inside a parallel loop to force a barrier synchronization to occur among the parallel threads. The routine takes no arguments, PPPPaaaaggggeeee 4444 MMMMPPPP((((3333CCCC)))) MMMMPPPP((((3333CCCC)))) returns no value, and does not require any initialization. _m_p__i_n__d_o_a_c_r_o_s_s__l_o_o_p answers the question "am I currently executing inside a parallel loop." This is needful in certain rare situations where you have an external routine that can be called both from inside a parallel loop and also from outside a parallel loop, and the routine must do different things depending on whether it is being called in parallel or not. _m_p__s_e_t__s_l_a_v_e__s_t_a_c_k_s_i_z_e sets the stacksize (in bytes) to be used by the slave processes when they are created (via _s_p_r_o_c_s_p(2)). The default size is 16MB. Note that slave processes only allocate their local data onto their stack, shared data (even if allocated on the master's stack) is not counted. _m_p__s_u_g_g_e_s_t_e_d__n_u_m_t_h_r_e_a_d_s uses the supplied value as a hint about how many threads to use in subsequent parallel regions, and returns the previous value of the number of threads to be employed in parallel regions. It does not affect currently executing parallel regions, if any. The implementation may ignore this hint depending on factors such as overall system load. This routine may also be called with the value 0, in which case it simply returns the number of threads to be employed in parallel regions without the side-effect present in _m_p__n_u_m_t_h_r_e_a_d_s. PPPPrrrraaaaggggmmmmaaaassss oooorrrr ddddiiiirrrreeeeccccttttiiiivvvveeeessss The MIPSpro C (and C++) compiler allows you to apply the capabilities of a Silicon Graphics multiprocessor computer to the execution of a single job. By coding a few simple directives, the compiler splits the job into concurrently executing pieces, thereby decreasing the wall-clock run time of the job. Directives enable, disable, or modify a feature of the compiler. Essentially, directives are command line options specified within the input file instead of on the command line. Unlike command line options, directives have no default setting. To invoke a directive, you must either toggle it on or set a desired value for its level. The following directives can be used in C (and C++) programs when compiled with the ----mmmmpppp option. ####pppprrrraaaaggggmmmmaaaa ppppaaaarrrraaaalllllllleeeellll This pragma denotes the start of a parallel region. The syntax for this pragma has a number of modifiers, but to run a single loop in parallel, the only modifiers you usually use are sssshhhhaaaarrrreeeedddd,,,, and llllooooccccaaaallll.... These options tell the multiprocessing compiler which variables to share between all threads of execution and which variables should be treated as local. In C, the code that comprises the parallel region is delimited by curly braces ({ }) and immediately follows the parallel pragma and PPPPaaaaggggeeee 5555 MMMMPPPP((((3333CCCC)))) MMMMPPPP((((3333CCCC)))) its modifiers. The syntax for this pragma is: ####pppprrrraaaaggggmmmmaaaa ppppaaaarrrraaaalllllllleeeellll sssshhhhaaaarrrreeeedddd ((((vvvvaaaarrrriiiiaaaabbbblllleeeessss)))) ####pppprrrraaaaggggmmmmaaaa llllooooccccaaaallll ((((vvvvaaaarrrriiiiaaaabbbblllleeeessss)))) ooooppppttttiiiioooonnnnaaaallll mmmmooooddddiiiiffffiiiieeeerrrrssss {{{{ccccooooddddeeee}}}} The parallel pragma has four modifiers: sssshhhhaaaarrrreeeedddd,,,, llllooooccccaaaallll,,,, iiiiffff,,,, and nnnnuuuummmmtttthhhhrrrreeeeaaaaddddssss.... Their definitions ares: sssshhhhaaaarrrreeeedddd (((( vvvvaaaarrrriiiiaaaabbbblllleeee____nnnnaaaammmmeeeessss )))) Tells the multiprocessing C compiler the names of all the variables that the threads must share. llllooooccccaaaallll (((( vvvvaaaarrrriiiiaaaabbbblllleeee____nnnnaaaammmmeeeessss )))) Tells the multiprocessing C compiler the names of all the variables that must be private to each thread. (When PCA sets up a parallel region, it does this for you.) iiiiffff (((( iiiinnnntttteeeeggggeeeerrrr____vvvvaaaalllluuuueeeedddd____eeeexxxxpppprrrr )))) Lets you set up a condition that is evaluated at run time to determine whether or not to run the statement(s) serially or in parallel. At compile time, it is not always possible to judge how much work a parallel region does (for example, loop indices are often calculated from data supplied at run time). Avoid running trivial amounts of code in parallel because you cannot make up the overhead associated with running code in parallel. PCA will also generate this condition as appropriate. If the iiiiffff condition is false (equal to zero), then the statement(s) runs serially. Otherwise, the statement(s) run in parallel. nnnnuuuummmmtttthhhhrrrreeeeaaaaddddssss((((eeeexxxxpppprrrr)))) Tells the multiprocessing C compiler the number of available threads to use when running this region in parallel. (The default is all the available threads.) In general, you should never have more threads of execution than you have processors, and you should specify numthreads with the MMMMPPPP____SSSSEEEETTTT____NNNNUUUUMMMMTTTTHHHHRRRREEEEAAAADDDDSSSS environmental variable at run time If you want to run a loop in parallel while you run some other code, you can use this option to tell the multiprocessing C compiler to use only some of the available threads. The expression eeeexxxxpppprrrr should evaluate to a positive integer. PPPPaaaaggggeeee 6666 MMMMPPPP((((3333CCCC)))) MMMMPPPP((((3333CCCC)))) For example, to start a parallel region in which to run the following code in parallel: for (idx=n; idx; idx--) { a[idx] = b[idx] + c[idx]; } you must write: #pragma parallel shared( a, b, c ) shared(n) local( idx ) or: #pragma parallel #pragma shared( a, b, c ) #pragma shared(n) #pragma local(idx) before the statement or compound statement (code in curly braces, { }) that comprises the parallel region. Any code within a parallel region but not within any of the explicit parallel constructs ( ppppffffoooorrrr,,,, iiiinnnnddddeeeeppppeeeennnnddddeeeennnntttt,,,, oooonnnneeee pppprrrroooocccceeeessssssssoooorrrr,,,, and ccccrrrriiiittttiiiiccccaaaallll ) is termed local code. Local code typically modifies only local data and is run by all threads. ####pppprrrraaaaggggmmmmaaaa ppppffffoooorrrr The ppppffffoooorrrr is contained within a parallel region. Use ####pppprrrraaaaggggmmmmaaaa ppppffffoooorrrr to run a for loop in parallel only if the loop meets all of these conditions: All the values of the index variable can be computed independently of the iterations. All iterations are independent of each other - that is, data used in one iteration does not depend on data created by another iteration. A quick test for independence: if the loop can be run backwards, then chances are good the iterations are independent. The loop control variable cannot be a field within a class/struct/union or an array element. The number of times the loop must be executed is determined once, upon entry to the loop, and is based on the loop initialization, loop test, and loop increment statements. PPPPaaaaggggeeee 7777 MMMMPPPP((((3333CCCC)))) MMMMPPPP((((3333CCCC)))) If the number of times the loop is actually executed is different from what is computed above, the results are unpredictable. This can happen if the loop test and increment change during the execution of the loop, or if there is an early exit from within the for loop. An early exit or a change to the loop test and increment during execution may have serious performance implications. The test or the increment should not contain expressions with side effects. The cccchhhhuuuunnnnkkkkssssiiiizzzzeeee,,,, if specified, is computed before the loop is executed, and the behavior is unpredictable if its value changes within the loop. If you are writing a ppppffffoooorrrr loop for the multiprocessing C++ compiler, the index variable _i can be declared within the for statement via int i = 0; The draft for the C++ standard states that the scope of the index variable declared in a for statement extends to the end of the for statement, as in this example: #pragma pfor for (int i = 0, ...) The C++ compiler doesn't enforce this; in fact, with this compiler the scope extends to the end of the enclosing block. Use care when writing code so that the subsequent change in scope rules for _i (in later compiler releases) do not affect the user code. If the code after a ppppffffoooorrrr is not dependent on the calculations made in the ppppffffoooorrrr loop, there is no reason to synchronize the threads of execution before they continue. So, if one thread from the ppppffffoooorrrr finishes early, it can go on to execute the serial code without waiting for the other threads to finish their part of the loop. The ####pppprrrraaaaggggmmmmaaaa ppppffffoooorrrr directive takes several modifiers; the only one that is required is iiiitttteeeerrrraaaatttteeee.... ####pppprrrraaaaggggmmmmaaaa ppppffffoooorrrr tells the compiler that each iteration of the loop is unique. It also partitions the iterations among the threads for execution. The syntax for ####pppprrrraaaaggggmmmmaaaa ppppffffoooorrrr is: ####pppprrrraaaaggggmmmmaaaa ppppffffoooorrrr iiiitttteeeerrrraaaatttteeee (((( )))) ooooppppttttiiiioooonnnnaaaallll____mmmmooooddddiiiiffffiiiieeeerrrrssss ffffoooorrrr ............ {{{{ ccccooooddddeeee ............ }}}} The ppppffffoooorrrr pragma has several modifiers. Their syntax is: PPPPaaaaggggeeee 8888 MMMMPPPP((((3333CCCC)))) MMMMPPPP((((3333CCCC)))) iiiitttteeeerrrraaaatttteeee ((((iiiinnnnddddeeeexxxx vvvvaaaarrrriiiiaaaabbbblllleeee====eeeexxxxpppprrrr1111;;;; eeeexxxxpppprrrr2222;;;; eeeexxxxpppprrrr3333 )))) llllooooccccaaaallll((((vvvvaaaarrrriiiiaaaabbbblllleeee lllliiiisssstttt)))) llllaaaassssttttllllooooccccaaaallll ((((vvvvaaaarrrriiiiaaaabbbblllleeee lllliiiisssstttt)))) rrrreeeedddduuuuccccttttiiiioooonnnn ((((vvvvaaaarrrriiiiaaaabbbblllleeee lllliiiisssstttt)))) aaaaffffffffiiiinnnniiiittttyyyy ((((vvvvaaaarrrriiiiaaaabbbblllleeee)))) ==== tttthhhhrrrreeeeaaaadddd ((((eeeexxxxpppprrrreeeessssssssiiiioooonnnn)))) sssscccchhhheeeeddddttttyyyyppppeeee ((((ttttyyyyppppeeee)))) cccchhhhuuuunnnnkkkkssssiiiizzzzeeee ((((eeeexxxxpppprrrr)))) Where: iiiitttteeeerrrraaaatttteeee ((((iiiinnnnddddeeeexxxx vvvvaaaarrrriiiiaaaabbbblllleeee====eeeexxxxpppprrrr1111;;;; eeeexxxxpppprrrr2222;;;; eeeexxxxpppprrrr3333 )))) Gives the multiprocessing C compiler the information it needs to identify the unique iterations of the loop and partition them to particular threads of execution. iiiinnnnddddeeeexxxx vvvvaaaarrrriiiiaaaabbbblllleeee is the index variable of the for loop you want to run in parallel. eeeexxxxpppprrrr1111 is the starting value for the loop index. eeeexxxxpppprrrr2222 is the number of iterations for the loop you want to run in parallel. eeeexxxxpppprrrr3333 is the increment of the for loop you want to run in parallel. llllooooccccaaaallll ((((vvvvaaaarrrriiiiaaaabbbblllleeee lllliiiisssstttt)))) Specifies variables that are local to each process. If a variable is declared as local, each iteration of the loop is given its own uninitialized copy of the variable. You can declare a variable as local if its value does not depend on any other iteration of the loop and if its value is used only within a single iteration. In effect the local variable is just temporary; a new copy can be created in each loop iteration without changing the final answer. llllaaaassssttttllllooooccccaaaallll ((((vvvvaaaarrrriiiiaaaabbbblllleeee lllliiiisssstttt)))) Specifies variables that are local to each process. Unlike with the llllooooccccaaaallll clause, the compiler saves only the value of the logically last iteration of the loop when it exits. rrrreeeedddduuuuccccttttiiiioooonnnn ((((vvvvaaaarrrriiiiaaaabbbblllleeee lllliiiisssstttt)))) Specifies variables involved in a reduction operation. In a reduction operation, the compiler keeps local copies of the variables and combines them when it exits the loop. An element of the reduction list must be an individual variable (also called a scalar variable) and cannot be an array or struct. However, it can be an individual element of an array. When the reduction modifier is used, it appears in the list with the correct PPPPaaaaggggeeee 9999 MMMMPPPP((((3333CCCC)))) MMMMPPPP((((3333CCCC)))) subscripts. One element of an array can be used in a reduction operation, while other elements of the array are used in other ways. To allow for this, if an element of an array appears in the reduction list, the entire array can also appear in the share list. The two types of reductions supported are _s_u_m(+) and _p_r_o_d_u_c_t(*). The compiler confirms that the reduction expression is legal by making some simple checks. The compiler does not, however, check all statements in the do loop for illegal reductions. You must ensure that the reduction variable is used correctly in a reduction operation. aaaaffffffffiiiinnnniiiittttyyyy ((((vvvvaaaarrrriiiiaaaabbbblllleeee)))) ==== tttthhhhrrrreeeeaaaadddd ((((eeeexxxxpppprrrreeeessssssssiiiioooonnnn)))) The effect of thread-affinity is to execute iteration "i" on the thread number given by the user-supplied expression (modulo the number of threads). Since the threads may need to evaluate this expression in each iteration of the loop, the variables used in the expression (other than the loop induction variable) must be declared shared and must not be modified during the execution of the loop. Violating these rules may lead to incorrect results. If the expression does not depend on the loop induction variable, then all iterations will execute on the same thread, and will not benefit from parallel execution. sssscccchhhheeeeddddttttyyyyppppeeee ((((ttttyyyyppppeeee)))) Tells the multiprocessing C compiler how to share the loop iterations among the processors. The schedtype chosen depends on the type of system you are using and the number of programs executing. You can use the following valid types to modify schedtype: ssssiiiimmmmpppplllleeee (the default) tells the run time scheduler to partition the iterations evenly among all the available threads. rrrruuuunnnnttttiiiimmmmeeee Tells the compiler that the real schedule type will be specified at run time. ddddyyyynnnnaaaammmmiiiicccc Tells the run time scheduler to give each thread chunksize iterations of the loop. chunksize should be smaller than PPPPaaaaggggeeee 11110000 MMMMPPPP((((3333CCCC)))) MMMMPPPP((((3333CCCC)))) (number of total iterations)/(number of threads). The advantage of dynamic over simple is that dynamic helps distribute the work more evenly than simple. Depending on the data, some iterations of a loop can take longer to compute than others, so some threads may finish long before the others. In this situation, if the iterations are distributed by simple, then the thread waits for the others. But if the iterations are distributed by dynamic, the thread doesn't wait, but goes back to get another chunksize iteration until the threads of execution have run all the iterations of the loop. iiiinnnntttteeeerrrrlllleeeeaaaavvvveeee Tells the run time scheduler to give each thread chunksize iterations (described below) of the loop, which are then assigned to the threads in an interleaved way. ggggssssssss ((((gggguuuuiiiiddddeeeedddd sssseeeellllffff----sssscccchhhheeeedddduuuulllliiiinnnngggg)))) Tells the run time scheduler to give each processor a varied number of iterations of the loop. This is like dynamic, but instead of a fixed chunksize, the chunk size iterations begin with big pieces and end with small pieces. If I iterations remain and P threads are working on them, the piece size is roughly: I/(2P) + 1 Programs with triangular matrices should use gss. cccchhhhuuuunnnnkkkkssssiiiizzzzeeee ((((eeeexxxxpppprrrr)))) Tells the multiprocessing C/C++ compiler how many iterations to define as a chunk when you use the dynamic or interleave modifier (described above). eeeexxxxpppprrrr should be positive integer, and should evaluate to the following formula: number of iterations / X where _X is between twice and ten times the number of threads. Select twice the number of threads when iterations vary slightly. Reduce the chunk size to reflect the increasing variance in the iterations. Performance gains may diminish after increasing _X to ten times the number of threads. PPPPaaaaggggeeee 11111111 MMMMPPPP((((3333CCCC)))) MMMMPPPP((((3333CCCC)))) ####pppprrrraaaaggggmmmmaaaa oooonnnneeee pppprrrroooocccceeeessssssssoooorrrr A ####pppprrrraaaaggggmmmmaaaa oooonnnneeee pppprrrroooocccceeeessssssssoooorrrr directive causes the statement that follows it to be executed by exactly one thread. The syntax of this pragma is: #pragma one processor { code } ####pppprrrraaaaggggmmmmaaaa ccccrrrriiiittttiiiiccccaaaallll Sometimes the bulk of the work done by a loop can be done in parallel, but the entire loop cannot run in parallel because of a single data-dependent statement. Often, you can move such a statement out of the parallel region. When that is not possible, you can sometimes use a lock on the statement to preserve the integrity of the data. In the multiprocessing C/C++ compiler, use the critical pragma to put a lock on a critical statement (or compound statement using { }). When you put a lock on a statement, only one thread at a time can execute that statement. If one thread is already working on a critical protected statement, any other thread that wants to execute that statement must wait until that thread has finished executing it. The syntax of the critical pragma is: #pragma critical (lock_variable) { code } The statement(s) after the critical pragma will be executed by all threads, one at a time. The lock variable _l_o_c_k__v_a_r_i_a_b_l_e is an optional integer variable that must be initialized to zero. The parentheses are required. If you don't specify a lock variable, the compiler automatically supplies one. Multiple critical constructs inside the same parallel region are considered to be independent of each other unless they use the same explicit lock variable. ####pppprrrraaaaggggmmmmaaaa iiiinnnnddddeeeeppppeeeennnnddddeeeennnntttt Running a loop in parallel is a class of parallelism sometimes called _f_i_n_e-_g_r_a_i_n_e_d _p_a_r_a_l_l_e_l_i_s_m or _h_o_m_o_g_e_n_e_o_u_s _p_a_r_a_l_l_e_l_i_s_m. It is called homogeneous because all the threads execute the same code on different data. Another class of parallelism is called _c_o_a_r_s_e- _g_r_a_i_n_e_d _p_a_r_a_l_l_e_l_i_s_m or _h_e_t_e_r_o_g_e_n_e_o_u_s _p_a_r_a_l_l_e_l_i_s_m. As the name suggests, the code in each thread of execution is different. PPPPaaaaggggeeee 11112222 MMMMPPPP((((3333CCCC)))) MMMMPPPP((((3333CCCC)))) Ensuring data independence for heterogeneous code executed in parallel is not always as easy as it is for homogeneous code executed in parallel. (Ensuring data independence for homogeneous code is not a trivial task.) The independent pragma has no modifiers. Use this pragma to tell the multiprocessing C/C++ compiler to run code in parallel with the rest of the code in the parallel region. The syntax for #pragma independent is: #pragma independent { code } SSSSyyyynnnncccchhhhrrrroooonnnniiiizzzzaaaattttiiiioooonnnn DDDDiiiirrrreeeeccccttttiiiivvvveeeessss To account for data dependencies, it is sometimes necessary for threads to wait for all other threads to complete executing an earlier section of code. Two sets of directives implement this coordination: ####pppprrrraaaaggggmmmmaaaa ssssyyyynnnncccchhhhrrrroooonnnniiiizzzzeeee and ####pppprrrraaaaggggmmmmaaaa eeeennnntttteeeerrrr////eeeexxxxiiiitttt ggggaaaatttteeee.... ####pppprrrraaaaggggmmmmaaaa ssssyyyynnnncccchhhhrrrroooonnnniiiizzzzeeee A ####pppprrrraaaaggggmmmmaaaa ssssyyyynnnncccchhhhrrrroooonnnniiiizzzzeeee tells the multiprocessing C/C++ compiler that within a parallel region, no thread can execute the statements that follows this pragma until all threads have reached it. This directive is a classic barrier construct. The syntax for this pragma is: #pragma synchronize ####pppprrrraaaaggggmmmmaaaa eeeennnntttteeeerrrr ggggaaaatttteeee ####pppprrrraaaaggggmmmmaaaa eeeexxxxiiiitttt ggggaaaatttteeee You can use two additional pragmas to coordinate the processing of code within a parallel region. These additional pragmas work as a matched set. They are ####pppprrrraaaaggggmmmmaaaa eeeennnntttteeeerrrr ggggaaaatttteeee and ####pppprrrraaaaggggmmmmaaaa eeeexxxxiiiitttt ggggaaaatttteeee.... A gate is a special barrier. No thread can exit the gate until all threads have entered it. This construct gives you more flexibility when managing dependencies between the work-sharing constructs within a parallel region. The syntax of the enter gate pragma is: PPPPaaaaggggeeee 11113333 MMMMPPPP((((3333CCCC)))) MMMMPPPP((((3333CCCC)))) ####pppprrrraaaaggggmmmmaaaa eeeennnntttteeeerrrr ggggaaaatttteeee For example, construct D may be dependent on construct A, and construct F may be dependent on construct B. However, you do not want to stop at construct D because all the threads have not cleared B. By using enter/exit gate pairs, you can make subtle distinctions about which construct is dependent on which other construct. Put this pragma after the work-sharing construct that all threads must clear before the ####pppprrrraaaaggggmmmmaaaa eeeexxxxiiiitttt ggggaaaatttteeee of the same name. The syntax of the exit gate pragma is: ####pppprrrraaaaggggmmmmaaaa eeeexxxxiiiitttt ggggaaaatttteeee Put this pragma before the work-sharing construct that is dependent on the preceding ####pppprrrraaaaggggmmmmaaaa eeeennnntttteeeerrrr ggggaaaatttteeee.... No thread enters this work- sharing construct until all threads have cleared the work-sharing construct controlled by the corresponding ####pppprrrraaaaggggmmmmaaaa eeeennnntttteeeerrrr ggggaaaatttteeee.... ####pppprrrraaaaggggmmmmaaaa ppppaaaaggggeeee____ppppllllaaaacccceeee The syntax of this pragma is: #pragma page_place (addr, size, threadnum) where _a_d_d_r is the starting address, _s_i_z_e is the size in bytes, and _t_h_r_e_a_d_n_u_m is the thread. On a system with physically distributed shared memory, for example, Origin2000), you can explicitly place all data pages spanned by the virtual address range [addr, addr + size-1] in the physical memory of the processor corresponding to the specified thread. SSSSEEEEEEEE AAAALLLLSSSSOOOO cc(1), f77(1), mp(3f), sync(3c), sync(3f), _M_I_P_S_p_r_o _P_o_w_e_r _C _P_r_o_g_r_a_m_m_e_r'_s _G_u_i_d_e, _M_I_P_S_p_r_o _C _L_a_n_g_u_a_g_e _R_e_f_e_r_e_n_c_e _M_a_n_u_a_l, _M_I_P_S_p_r_o _F_O_R_T_R_A_N _7_7 _P_r_o_g_r_a_m_m_e_r'_s _G_u_i_d_e PPPPaaaaggggeeee 11114444